[KV Connector] Support using FlexKV as KV Cache Offloading option.#34328
[KV Connector] Support using FlexKV as KV Cache Offloading option.#34328vllm-bot merged 3 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
|
Documentation preview: https://vllm--34328.org.readthedocs.build/en/34328/ |
There was a problem hiding this comment.
Code Review
This pull request introduces support for FlexKV as a KV cache offloading option by adding a new FlexKVConnectorV1. The changes include the connector implementation, factory registration, documentation, and a new example script. The core connector is a thin wrapper around the flexkv library. The implementation looks good, but the example script prefix_caching_flexkv.py has a few issues that should be addressed to make it more robust and user-friendly. Specifically, it uses hardcoded paths for IPC and the model, which hurts portability, and relies on time.sleep() for synchronization, which is unreliable. My review provides suggestions to fix these issues.
|
|
||
|
|
||
| flexkv_config = { | ||
| "server_recv_port": "ipc:///tmp/flexkv_test", |
There was a problem hiding this comment.
The hardcoded IPC path ipc:///tmp/flexkv_test can lead to conflicts if multiple instances of this example are run concurrently on the same machine. This could cause unexpected failures or race conditions. To ensure the example is robust and can be run in parallel, a unique path should be generated for each run, for example by appending the process ID.
| "server_recv_port": "ipc:///tmp/flexkv_test", | |
| "server_recv_port": f"ipc:///tmp/flexkv_test_{os.getpid()}", |
| "kv_role": "kv_both", | ||
| } | ||
|
|
||
| model_path = os.environ.get("MODEL_PATH", "/data0/models/Qwen3/Qwen3-32B") |
There was a problem hiding this comment.
The default model path "/data0/models/Qwen3/Qwen3-32B" is hardcoded to a specific developer's environment. This will cause the example to fail for other users who do not have this path. To make the example runnable for everyone, it's better to remove the default value and require the user to set the MODEL_PATH environment variable, raising an error if it's not set.
| model_path = os.environ.get("MODEL_PATH", "/data0/models/Qwen3/Qwen3-32B") | |
| model_path = os.environ.get("MODEL_PATH") | |
| if model_path is None: | |
| raise ValueError("Please set the MODEL_PATH environment variable.") |
| prefix_cached_llm.generate(generating_prompts[0], sampling_params) | ||
|
|
||
| # wait for offload kv task finished. | ||
| time.sleep(2) |
There was a problem hiding this comment.
Using time.sleep(2) to wait for an asynchronous offload task to complete is unreliable. The task might take more or less time depending on system load and other factors. This can lead to flaky behavior in the example, where it might fail intermittently or have unnecessary delays. A more robust synchronization mechanism should be used. If the FlexKV API provides a way to block until the operation is complete (e.g., a future, an event, or a blocking call), it should be used instead. This would make the example more reliable and demonstrate better programming practices.
There was a problem hiding this comment.
Absolutely right that time.sleep(2) is not a reliable synchronization mechanism. The offload task here is an asynchronous background operation managed internally by FlexKV. Unfortunately, the current LLM (offline inference) API does not expose a blocking interface to wait for the offload to complete — the synchronization (wait_for_save) happens internally at the worker level during the forward pass. In practice, 2 seconds is more than sufficient for the async KV transfer task to complete in typical scenarios, so this works reliably for the purpose of this example.
| print(f"Generated answers are the same: {generated_same}") | ||
|
|
||
| # wait for offload kv task finished. | ||
| time.sleep(2) |
There was a problem hiding this comment.
This is the same case as the previous comment — the 2-second sleep is intentional here for the same reason.
| vllm_config=vllm_config, role=role, kv_cache_config=kv_cache_config | ||
| ) | ||
| try: | ||
| from flexkv.integration.vllm.vllm_v1_adapter import FlexKVConnectorV1Impl |
There was a problem hiding this comment.
Looking at the vLLM APIs used by this connector:
# vllm
from vllm.distributed.kv_transfer.kv_connector.v1.base import (
KVConnectorMetadata, KVConnectorRole)
from vllm.distributed.kv_transfer.kv_connector.v1.metrics import KVConnectorStats
from vllm.distributed.parallel_state import get_tp_group
if TYPE_CHECKING:
from vllm.config import VllmConfig
from vllm.v1.core.sched.output import SchedulerOutput
from vllm.attention.backends.abstract import AttentionMetadata
from vllm.distributed.kv_events import KVCacheEvent
from vllm.forward_context import ForwardContext
from vllm.v1.core.kv_cache_manager import KVCacheBlocks
from vllm.v1.request import Request
from vllm.v1.outputs import KVConnectorOutput
I don't think we (the vLLM project) commit to backwards compatibility with these APIs, so it's quite likely the connector will need to be updated as new vLLM releases come out
(This is really just an FYI - I don't have an alternate recommendation, and I can't even be confidently specific about which of these APIs we do expect to remain backwards compatible)
There was a problem hiding this comment.
Thanks for the heads-up! We understand that these internal APIs are subject to change and vLLM does not guarantee backwards compatibility. To address this, we have added type checking and compatibility logic in the FlexKV-vLLM adapter code to better handle potential changes. We (the FlexKV team) will keep an eye on vLLM updates and maintain the connector to ensure compatibility with future releases.
There was a problem hiding this comment.
You could also setup a test job in CI that installs FlexKV so we know when it breaks
There was a problem hiding this comment.
You could also setup a test job in CI that installs FlexKV so we know when it breaks
Thanks for the suggestion! We've added a CI workflow in the FlexKV repository to track vLLM API compatibility: .github/workflows/vllm-compat-test.yml
It runs on a daily schedule and covers three checks:
All vLLM internal symbols used by FlexKVConnectorV1 are still importable.
FlexKVConnectorV1 implements every abstract method required by KVConnectorBase_V1.
All overridden methods still exist in the base class.
We think it's more appropriate to maintain this CI test in the FlexKV repository itself rather than adding it to vLLM's CI, to avoid introducing external dependencies into vLLM's test pipeline.
9805c7a to
601880a
Compare
|
Hi @feiqiangs, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
601880a to
20923e8
Compare
|
Hi @feiqiangs, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
| 1. git clone git@github.com:taco-project/FlexKV.git | ||
| 2. cd FlexKV && bash build.sh |
There was a problem hiding this comment.
Could you consider pinning to a branch or commit? I'm not sure how active development is
There was a problem hiding this comment.
Thanks for the suggestion! We intentionally keep this as a clone of the main branch rather than pinning to a specific tag/commit, because FlexKV maintains a dedicated CI workflow (vllm-compat-test.yml) that continuously validates compatibility with vLLM. This ensures the main branch of FlexKV is always compatible with the corresponding vLLM version. We (the FlexKV team) are committed to keeping the main branch in a release-ready state at all times. Pinning to a specific commit would require frequent updates to this example whenever FlexKV releases bug fixes.
| vllm_config=vllm_config, role=role, kv_cache_config=kv_cache_config | ||
| ) | ||
| try: | ||
| from flexkv.integration.vllm.vllm_v1_adapter import FlexKVConnectorV1Impl |
There was a problem hiding this comment.
You could also setup a test job in CI that installs FlexKV so we know when it breaks
b6cc98a to
0dfe92a
Compare
|
|
||
| Usage: | ||
| 1. Run this script: | ||
| python examples/offline_inference/prefix_caching_flexkv.py |
There was a problem hiding this comment.
Seems you are missing the MODEL_PATH env set in your example. You could also just make it a CLI arg
There was a problem hiding this comment.
Thanks! I've updated the example to use --model as a required CLI argument instead of the MODEL_PATH env var. The usage is now:
python examples/offline_inference/prefix_caching_flexkv.py \
--model /path/to/your/model
| from flexkv.integration.vllm.vllm_v1_adapter import FlexKVConnectorV1Impl | ||
| except ImportError as e: | ||
| raise ImportError( | ||
| "FlexKV is not installed. Please install it to use FlexKVConnectorV1. " |
There was a problem hiding this comment.
It would be best to share a link or instructions to install so the user can be guided properly
There was a problem hiding this comment.
Thanks for the feedback! I've updated the docstring to include both a link and step-by-step installation instructions. The current code now reads:
Installation:
See https://github.com/taco-project/FlexKV for installation instructions.
Quick start::
git clone git@github.com:taco-project/FlexKV.git
cd FlexKV && bash build.sh
Additionally, the ImportError message also directs users to the GitHub repo:
raise ImportError(
"FlexKV is not installed. Please install it to use "
"FlexKVConnectorV1. See https://github.com/taco-project/FlexKV "
"for installation instructions."
) from e
This should give users clear guidance
NickLucche
left a comment
There was a problem hiding this comment.
Apologies for the delay in getting back to you on this one @feiqiangs .
Do we have any tests that we can run along with this connector?
| **kwargs (Any): additional arguments for the load operation | ||
|
|
||
| Note: | ||
| The number of elements in kv_caches and layer_names should be | ||
| the same. | ||
|
|
||
| """ | ||
| self._flexkv_connector.start_load_kv(forward_context, **kwargs) | ||
|
|
||
| def wait_for_layer_load(self, layer_name: str) -> None: | ||
| """ | ||
| Block until the KV for a specific layer is loaded into vLLM's | ||
| paged buffer. This is called from within attention layer to ensure | ||
| async copying from start_load_kv is complete. | ||
|
|
||
| This interface will be useful for layer-by-layer pipelining. | ||
|
|
||
| Args: | ||
| layer_name: the name of that layer | ||
| """ | ||
| self._flexkv_connector.wait_for_layer_load(layer_name) |
There was a problem hiding this comment.
Are you making use of both these paths here?
Usually connectors either load by layer blocking or load all layers async, but I dont see a flag to switch logic here
There was a problem hiding this comment.
Thanks for the question, and apologies for the confusing docstrings!
To clarify the design: FlexKV uses a scheduler-side transfer model (similar to NIXL's scheduler-side coordination) — all KV transfers are coordinated by the scheduler connector — build_connector_meta calls launch_tasks() to kick off async transfers, and update_connector_output polls query_finished_task() for completion. KV blocks are moved directly between the FlexKV server and vLLM's GPU memory without any worker-side intervention during the forward pass.
So to answer your question: we are not using both paths — start_load_kv, wait_for_layer_load, save_kv_layer, and wait_for_save are all currently no-ops. We keep them because:
KVConnectorBase_V1requires implementing these methods- They serve as extension points for a potential future worker-side layer-pipelining optimisation
We've updated the docstrings in the latest commit to make this explicit. Sorry for the confusion caused by the misleading comments!
16041e9 to
d46be5f
Compare
d46be5f to
88464f9
Compare
No worries at all @NickLucche — apologies for the delay on my end as well. Yes, I've added unit tests for the FlexKV connector. You can find them at:
The tests mock the external
|
This commit introduces the FlexKV connector, enabling integration with FlexKV, a distributed KV Store and multi-level cache management system for ultra-large-scale LLM inference. Signed-off-by: phaedonsun <phaedonsun@tencent.com>
3e069f5 to
cac2953
Compare
|
@mgoin @NickLucche Thanks for the review and suggestions. Seems the pr is ready now? Can you please approve this PR to push this forward? Thanks |
mgoin
left a comment
There was a problem hiding this comment.
LGTM, thanks for iterating. Will keep the CI in the external package for now then
…llm-project#34328) Signed-off-by: phaedonsun <phaedonsun@tencent.com> Co-authored-by: phaedonsun <phaedonsun@tencent.com>
…llm-project#34328) Signed-off-by: phaedonsun <phaedonsun@tencent.com> Co-authored-by: phaedonsun <phaedonsun@tencent.com>
…llm-project#34328) Signed-off-by: phaedonsun <phaedonsun@tencent.com> Co-authored-by: phaedonsun <phaedonsun@tencent.com>
…llm-project#34328) Signed-off-by: phaedonsun <phaedonsun@tencent.com> Co-authored-by: phaedonsun <phaedonsun@tencent.com>
…llm-project#34328) Signed-off-by: phaedonsun <phaedonsun@tencent.com> Co-authored-by: phaedonsun <phaedonsun@tencent.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
…llm-project#34328) Signed-off-by: phaedonsun <phaedonsun@tencent.com> Co-authored-by: phaedonsun <phaedonsun@tencent.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>
…llm-project#34328) Signed-off-by: phaedonsun <phaedonsun@tencent.com> Co-authored-by: phaedonsun <phaedonsun@tencent.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>
…llm-project#34328) Signed-off-by: phaedonsun <phaedonsun@tencent.com> Co-authored-by: phaedonsun <phaedonsun@tencent.com>
FlexKV is a distributed KV store and multi-level cache management system developed by Tencent Cloud's TACO team in collaboration with the community, designed for large-scale LLM inference scenarios. FlexKV leverages multi-level caching and a distributed KVCache pool to enable inference engines to achieve higher throughput and lower latency.
In our case, when intergated with FlexKV, we can achieve the following improvement:
ISL=21K, OSL=1K, batch_size=8: TTFT decreases by 60%, TPOT increases by 13%, and QPM increases by 16%.